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1  Introduction 


This  paper  discusses  the  Deceit  Distributed  File  System  which  is  being  developed  at  Cornell 
University.  The  research  emphasis  of  the  Deceit  system  is  a  more  flexible  file  semantics 
to  address  the  problems  of  scalability1,  efficiency,  and  reliability.  Our  premise  is  that  it 
is  valuable  for  the  user  to  be  able  to  adjust  system  semantics  on  a  per  file  basis.  Needed 
features  may  be  employed  without  paying  a  penalty  for  unused  features.  The  default 
behavior  is  equivalent  to  NFS. 

Users  select  atypical  features  by  overriding  the  default  selections  established  by  Deceit. 
Here  are  some  of  the  choices  that  a  user  or  system  administrator  might  make: 

•  Availability — there  are  many  techniques  known  for  handling  machine  crashes  and 
communication  loss.  Different  solutions  offer  varying  levels  and  types  of  resilience 
and  cost.  For  example,  data  replication  reduces  the  probability  that  the  file  will 
become  unavailable  for  reading,  but  file  updates  become  more  expensive. 

•  Update  propagation — there  is  a  delay  between  when  a  client  issues  an  update  and 
when  that  update  is  apparent  to  other  clients.  Some  applications  may  have  con¬ 
straints  on  this  delay. 

•  Causality — constraints  may  exist  between  files  to  provide  file  consistency.  By  having 
the  file. system  enforce  such  constraints,  the  user  may  augment  or  reduce  the  concur¬ 
rency  of  updates  and  the  success  of  file  caching.  For  example,  a  run-time  debugger 
may  require  that  an  executable  file  and  its  source  file  are  consistent. 

•  Update  stability — it  may  not  be  necessary  for  an  update  to  be  written  to  non-volatile 
storage  or  sent  to  all  replicas  immediately.  Asynchronous  update  propagation  can 
produce  dramatic  improvements  in  performance.  Note  that  an  update  can  be  visible 
to  all  clients  before  it  is  has  been  delivered  to  all  file  replicas.  The  difference  between 
update  propagation  and  stability  is  clarified  below. 

The  user  can  dynamically  set  file  parameters  to  select  the  method  that  Deceit  uses  to 
provide  the  above  properties.  This  paper  describes  the  components  of  the  Deceit  file 
system  which  provide  these  file  parameters  and  also  gives  a  less  detailed  description  of  the 

Un  this  discussion  scalability  will  refer  to  the  total  number  of  file  servers  rather  than  the  number  of 
clients. 
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entire  architecture.  Some  additional  features  that  we  plan  to  add  in  the  near  future  are 
discussed  also. 

There  are  many  existing  distributed  file  systems  including  RNFS,  Andrew,  Sprite,  Locus. 
Amoeba,  SWALLOW,  DFS,  and  Cedar  (see  bibliography  for  an  extensive  list  of  references). 
We  are  deeply  indebted  to  these  other  distributed  file  systems  for  many  of  Deceit’s  design 
features.  A  full  presentation  of  other  distributed  file  systems  is  beyond  the  scope  of  this 
paper,  and  a  partial  presentation  would  inevitably  be  unfair,  so  we  only  compare  Deceit 
to  NFS  in  this  paper. 

This  organization  of  this  paper  is  as  follows.  Section  2  gives  a  general  description  of  the 
architecture  with  the  major  assumptions  underlying  the  Deceit  design.  Section  3  describes 
the  file  replication  management  protocols.  Section  4  describes  the  parameters  that  a  user 
can  set  on  a  file.  Section  5  is  a  detailed  description  of  the  logical  structure  of  Deceit. 
Section  6  describes  some  example  scenarios  where  Deceit  might  be  used,  and  Section  7  is 
the  conclusion. 


2  General  Architecture 

2.1  Contrast  between  NFS  and  Deceit 

One  way  to  understand  the  Deceit  architecture  is  to  contrast  it  to  the  NFS  architecture^?, 
48,46].  In  a  normal  NFS  implementation,  each  server  machine  maintains  a  set  of  hies 
disjoint  from  the  sets  maintained  by  all  other  servers.  These  sets  are  structured  into 
directory  trees,  and  each  server  may  provide  more  than  one  directory  tree.  The  file  name 
space2  is  built  by  linking  together  the  directory  trees  provided  by  the  servers  into  a  single 
tree.  This  linking  is  done  separately  at  each  client.  Refer  to  Figure  1  for  an  example 
directory  tree.  Clients  may  communicate  with  any  subset  of  the  NFS  servers,  but  servers 
normally  never  communicate  with  each  other;  of  course,  servers  may  act  as  normal  clients 
to  each  other. 

Deceit  and  NFS  use  the  same  client/server  communication  protocol  (i.e.  the  same  transport 
and  RPC  interface),  so  a  Deceit  service  appears  to  be  a  NFS  file  service  to  a  client.  As 
a  result,  Deceit  provides  a  normal  NFS  name  tree.  All  NFS  operations  are  supported 

2The  file  name  space  for  a  file  system  is  the  mapping  between  full  path  names  and  logical  or  physical 
files. 
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with  no  change  to  any  client  software.  The  file  handle  is  an  important  component  of  the 
NFS  protocol.  A  file  handle  is  associated  with  each  file  or  directory,  and  clients  usually 
refer  to  files  or  directories  by  file  handle.  This  type  of  handle  for  files  is  common  in  the 
file  system  literature[26,34,36].  Since  Deceit  uses  the  NFS  protocol,  Deceit  provides  file 
handles.  These  file  handles  are  guaranteed  to  be  unique  and  usable  as  long  as  a  replica  of 
the  file  exists. 

A  main  difference  between  Deceit  and  NFS  is  that  files  are  not  statically  bound  to  any 
particular  server;  with  Deceit,  files  may  move  freely  between  servers.  If  a  client  request 
arrives  for  a  file  at  a  server  which  does  hot  have  that  file,  the  request  is  automatically 
forwarded  to  a  server  that  has  the  file.  The  reply  is  propagated  backwards  along  the  same 
path.  All  servers  provide  an  identical  file  service  to  clients  so  that  clients  have  to  explicitly 
connect  to  only  one  server  in  order  to  access  the  entire  Deceit  service.  A  comparison 
between  Deceit  and  normal  NFS  communications  paths  is  provided  in  Figure  2.  When 
one  machine  fails,  Deceit  clients  can  connect  to  another  machine  and  continue  operation; 
standard  NFS  client  software  does  not  provide  this  capability.  Users  may  think  of  Deceit 
as  a  single,  highly  reliable  and  responsive  server. 
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Deceit  Communication  Paths 


Figure  2:  communication  paths 


A  second  difference  is  that  Deceit  allows  replication  of  files.  Files  may  have  non-volatile 
replicas  on  any  subset  of  the  servers.  It  is  important  for  disk  and  communication  efficiency 
that  files  are  not  always  replicated  on  every  server,  since  such  a  high  degree  of  replication 
is  unnecessary  for  most  applications.  To  this  end,  the  user  cam  specify  a  desired  replica¬ 
tion  level  and  can  provide  explicit  control  over  the  placement  of  file  replicas,  if  desired. 
Directories  are  handled  similarly  to  files;  they  are  stored  on  a  subset  of  the  available  servers. 

A  third  difference  is  that  Deceit  supports  a  file  version  control  mechanism.  A  user  may 
explicitly  produce,  manipulate,  and  delete  specific  versions  of  a  file.  Also,  a  user  can  inquire 
about  the  relationships  between  versions  and  ask  Deceit  to  delete  obsolete  versions.  This 
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version  control  mechanism  is  “blended”  into  NFS  semantics  so  that  its  use  is  optional. 

Deceit  provides  a  superset  of  NFS  functionality.  To  allow  the  user  to  access  this  function¬ 
ality,  Deceit  has  additional  commands  and  file  operations  beyond  normal  NFS  operations. 
Clients  access  these  features  by  using  special  RPCs  and  by  reads  and  writes  to  invisible 
control  files.  Special  commands  are  provided  to  list  all  versions  of  a  file,  locate  all  replicas 
of  a  file,  modify  file  parameters,  reconcile  directory  versions,  and  provide  other  functions. 
These  commands  are  discussed  below. 


2.2  Cells 

In  the  above  discussion,  it  was  assumed  that  all  clients  could  directly  access  any  Deceit 
server,  but  this  property  is  not  necessarily  true.  Deceit  servers  can  be  subdivided  into  cells 
to  prevent  Deceit  from  being  non-secure  (and  inefficient)  in  a  very  large  implementation. 
Each  cell  is  an  independent  instantiation  of  Deceit  with  distinct  files  and  processes.  Each 
cell  maintains  its  own  name  space,  and  replication  must  be  contained  within  a  ceil.  A 
cell  provides  security  and  administrative  boundaries.  In  our  present  implementation,  cells 
correspond  to  ISIS  site  clusters.  An  example  of  Deceit  cells  is  shown  in  Figure  3. 

Access  between  cells  is  provided  through  a  logical  directory.  There  is  a  logical  direc¬ 
tory  called  the  global  Toot  directory.  It  cannot  be  listed,  as  it  implicitly  contains  the 
full  machine  names  of  every  accessible  Deceit  server.  Instead,  it  is  used  indirectly  as  a 
subdirectory  of  a  normal  directory.  For  example,  if  a  user  is  in  the  Cornell  computer  sci¬ 
ence  cell  and  wants  to  access  files  in  the  MIT  computer  science  cell,  he  picks  a  machine 
“foo.cs.mit.edu”  at  MIT  where  a  Deceit  server  is  running.  By  executing  the  command  “cd 
/priv /global/foo.cs.mit.edu” ,  a  user  can  access  the  MIT  cell  with  normal  file  operations. 
The  global  root  directory  is  a  subdirectory  of  “/priv.”  The  Cornell  cell  acts  as  a  client  to 
the  MIT  cell.  Mount  and  access  restrictions  are  applied  as  with  any  client. 


2.3  Design  Assumptions 

A  list  of  assumptions  about  the  environment  where  a  system  will  be  used  is  fundamental  to 
any  design.  We  will  provide  a  short  summary  of  our  assumptions  here.  The  assumptions 
are  grouped  into  three  categories:  network  architecture,  failure,  and  typical  operational 
b  'havior. 
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f~]  -  single  Deceit  server 

Figure  3:  Example  configuration  of  Deceit  servers  with  cells 
Network  Assumptions 

The  target  environment  is  a  network  of  computers  used  in  a  client/server  fashion.  Some  of 
the  computers  may  be  diskless,  and  some  may  be  large  dedicated  file  servers.  We  believe 
that  NFS  offers  an  adequate  file  system  interface  for  our  purposes,  and  NFS  is  widely 
accepted,  hence  all  file  requests  from  the  clients  are  via  the  standard  NFS  interface.  Under 
normal  conditions,  all  machines  can  communicate  directly  with  each  other  through  an 
underlying  network.  Communication  is  symmetric:  if  a  can  send  a  message  t<-  6,  then  b 
can  send  a  message  to  a.  The  severs  are  grouped  into  administrative  subsets  called  cells, 
such  that  each  cell  is  managed  by  a  single  centralized  administration.  Cells  are  assumed  to 
be  a  small  number  of  local  area  networks  (e.g.  10-100  machines).  Network  communication 
is  seen  e:  messages  are  sent  to  the  correct  destination  with  a  correct  sources  address,  and 
messages  can  not  be  examined  by  machines  which  are  not  the  intended  receiver.  Since  all 
communication  between  servers  is  through  the  ISIS  distributed  system[2,3],  all  of  the  ISIS 
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communication  assumptions  are  present  in  Deceit. 


Failure  Assumptions 

We  assume  that  machines  may  crash  without  notification3;  messages  may  be  lost  dur¬ 
ing  transmission;  and  the  network  may  experience  long  term  communication  partition.4 
All  machines  have  roughly  independent  failure  probabilities.  Network  partitions  may  be 
frequent.  Within  a  cell,  servers  trust  each  other,  but  between  cells  there  is  minimal  trust. 


Operational  Assumptions 

Predictable  file  access  patterns  are  central  to  the  design  and  performance  of  Deceit.  Many 
of  Deceit’s  design  decisions  were  based  on  results  from  studies  which  were  done  in  an 
academic  environment[39,14,13,44]. 

Deceit’s  operational  assumptions  are  as  follows.  Files  tend  to  be  written  or  read  in  their 
entirety  with  a  stream  of  operations.  Nearly  simultaneous  writes  by  two  clients  to  the 
same  file  are  very  rare.  Files  experience  long  periods  of  total  inactivity  punctuated  by 
high  activity  where  they  may  be  rewritten  several  times  in  a  few  minutes.  File  activity 
tends  to  cluster  in  a  small  number  of  directories.  The  vast  majority  of  NFS  operations  are 
get  attribute  (get  basic  file  attributes),  lookup  (find  a  file  by  name  in  a  directory),  read,  and 
write.  Most  files  are  small,  i.e.  less  than  20  kilobytes. 


2.4  Related  Topics 

The  ISIS  distributed  system  is  used  for  crash  and  partition  detection,  communication 
primitives,  and  process  group  management[33,22].  Some  features  that  ISIS  provides  are: 
several  group  broadcast  protocols,  atomic  group  membership  change,  mechanisms  for  lo¬ 
cating  group  members  by  group  name,  light-weight  processes  with  signals  and  semaphores, 
architecture  independent  communication,  and  process  state  transfer.  As  a  detailed  discus¬ 
sion  of  ISIS  would  be  a  digression,  the  reader  is  referred  to  [4]  for  more  information. 

3For  a  more  detailed  discussion  of  machine  and  communication  failure  models,  please  refer  to  [3], 

4Some  readers  may  be  aware  that  early  versions  of  ISIS  blocked  during  network  partitions.  As  part  of 
our  work  on  Deceit,  and  other  ISIS  applications,  this  issue  was  reexamined.  We  expect  a  version  of  ISIS 
capable  of  surviving  network  partion  to  be  available  shortly. 
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Throughout  this  paper,  the  term  “user”  will  be  used  to  refer  to  the  person  or  process 
who  is  initiating  file  system  operations.  In  practice,  some  file  system  operations  will  need 
to  be  restricted  beyond  normal  file  security.  For  example,  a  system  administrator  may 
mandate  that  users  can  generate  at  most  three  replicas  of  a  file.  The  separation  of  powers 
between  users  and  system  administrators  is  a  fruitful  area  of  research.  In  this  paper,  these 
distinctions  will  not  be  discussed. 


3  Replication  Management 

Replication  is  one  of  the  most  important  mechanisms  in  Deceit.  If  critical  system  compo¬ 
nents  are  distributed  over  several  machines,  then  it  is  more  likely  that  one  of  them  will  be 
unavailable  at  any  time.  Replicating  components  is  an  obvious  solution.  There  are  only  a 
few  file  systems  that  allow  files  to  be  replicated  among  a  set  of  servers[l0,9,19,30,8,37]. 

Note,  however,  that  caching  is  an  important  form  of  replication.  Since  caching  is  used  in 
all  file  systems,  then  they  all  must  support  some  form  of  replication.  The  Andrew  File 
System[21]  is  a  good  example.  Andrew  supports  caching  on  a  client  disk  and  in  client 
memory.  Deceit  also  supports  client  memory  caching. 

A  desirable  property  of  any  replicated  data  system  is  one-copy  serializability .  A  file  / 
exhibits  one-copy  serializability  if  the  results  of  reads  and  writes  to  /  are  indistinguishable 
from  the  outcome  of  performing  the  same  operations  in  a  setting  where  there  is  only  one 
replica  of  /.  One-copy  serializability  is  a  useful  property  since  it  implies  that  the  existence 
of  multiple  replicas  is  hidden  at  the  user  level. 


3.1  Replica  Generation 

Associated  with  each  Deceit  file  is  a  minimum  replica  level  that  can  be  defined  and  changed 
through  a  special  command.  If  file  /  has  a  minimum  replica  level  of  r,  then  Deceit  will 
insure  that  there  are  at  least  r  non-volatile  replicas  of  /  as  long  as  enough  servers  are 
available.  To  do  so,  new  replicas  may  need  to  be  generated.  Associated  with  /  is  a  server, 
t,  called  the  token  holder  of  /.  The  token  holder  is  responsible  for  generating  and  deleting 
file  replicas.  Tokens  will  be  discussed  in  more  detail  in  Section  3.3.  There  are  four  ways 
that  a  replica  can  be  generated: 


sOne-copy  serializability  is  defined  assuming  all  interprocess  communication  is  through  files. 
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1.  The  token  holder  t  may  lose  contact  with  a  replica,  t  counts  the  number  of  correct 
replies  to  an  update  broadcast  for  /.  If  the  number  of  replies  drops  below  r,  then 
t  will  create  new  replicas.  If  there  are  no  updates,  replicas  may  become  unavailable 
and  later  available  without  causing  a  new  replica  to  be  generated. 

2.  If  the  minimum  replica  level  is  increased,  t  will  create  new  replicas. 

3.  A  user  may  request  the  token  holder  t  to  create  or  delete  a  replica  on  a  specific  server 
with  a  special  command.  Users  may  inquire  about  the  current  location  of  all  replicas 
for  a  file  with  another  special  command 

4.  A  server  may  request  that  a  replica  be  generated  in  order  to  improve  read  perfor¬ 
mance. 

Method  4  occurs  as  follows.  If  a  client  accesses  file  /  through  a  server  s  which  does  not 
have  a  replica  of  /,  then  the  operation  is  forwarded  to  a  server  which  has  a  replica  of  /.  As 
a  background  activity,  a  local  non-volatile  replica  is  generated  on  s  to  speed  future  reads 
and  help  ensure  availability.  In  this  manner,  file  migration  is  achieved  with  the  replication 
mechanism.  Each  client  slowly  gathers  its  working  set  of  files  to  the  server  to  which  it 
has  connected.  In  some  cases,  the  user  may  prefer  that  a  replica  is  not  automatically 
generated;  this  parameter  may  be  set  by  the  user. 

Replicas  are  generated  with  a  file  transfer  protocol  from  an  existing  replica.  A  replica 
holder  feeds  a  copy  of  the  file  to  the  site  where  the  replica  is  being  generated  through  a 
TCP  connection.  Non-blocking  I/O  and  careful  buffer  management  allow  the  connection 
to  run  at  high  efficiency.  The  token  holder  delays  updates  during  replica  generation  to 
prevent  inconsistency. 

Eventually,  there  may  exist  several  unneeded  replicas 
delete  these  extra  replicas  when  an  update  occurs  ini 
deleted  in  least-recently-used  order.  The  user  may  ask  t 
command. 

Some  existing  DFSs  allow  files  to  be  divided  into  segments  for  caching  or  replication.  This 
option  allows  finer  grain  control  over  data  movement  and  more  efficient  access  to  very  large 
files.  Unfortunately,  it  does  not  work  well  with  the  NFS  protocol  a  .  ’  it  greatly  increases 
the  complexity  of  the  system.  We  have  decided  not  to  provide  this  option  at  the  present 
time. 


The  token  holder  t  will 
'*ing  them.  They  are 
replica  with  a  special 
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3.2  File  Groups 

For  any  file,  /,  there  is  an  explicit  process  group  of  servers  that  need  current  information 
about  /,  which  we  will  call  the  file  group  of  /.  A  process  group  is  a  set  of  machines 
or  processes;  there  must  be  a  mechanism  for  broadcasting  messages  to  all  members  and 
sending  messages  to  individual  members.  Deceit  represents  each  file  group  with  an  ISIS 
process  group. 

The  file  group  for  /  contains  all  servers  that  have  a  replica  of  the  file  or  have  cached 
information  about  the  file.  This  set  is  a  superset  of  the  replica  holders,  and  it  includes 
those  servers  which  cache  only  timestamps  or  mode  bits.  The  fundamental  operation 
within  a  file  group  is  update  distribution.  An  update  to  /  originates  from  a  client  and  is 
given  to  its  server.  That  server  then  broadcasts  the  update  to  all  members  of  f's  file  group; 
no  other  servers  receive  this  update  for  /.  Refer  to  Figure  4  for  a  schematic  description 
of  update  distribution.  The  concept  of  a  file  group  is  fundamental  to  the  scalability  of  the 
entire  system,  since  only  the  size  of  /’ s  file  group  affects  the  speed  of  updates  to  /. 

In  Deceit,  a  server  needs  to  join  a  file  group  before  it  is  allowed  to  broadcast  an  update 
to,  or  have  a  replica  of,  that  file.  Joining  a  file  group  is  an  expensive  operation  and  may 
require  a  global  search  to  find  a  member  of  the  group.  This  operation  is  one  of  the  main 
obstacles  to  scaling  Deceit  to  an  arbitrary  size.  Deceit  limits  global  search  to  within  a 
Deceit  cell  to  ameliorate  this  problem.  As  a  result,  file  groups  must  stay  within  a  single 
cell. 

We  believe  that  much  of  the  complexity  of  distributed  file  systems  arise  in  problems  analo¬ 
gous  to  those  found  in  group  management.  For  example,  a  read/write  quorum  protocol  can 
be  viewed  as  a  protocol  for  atomically  broadcasting  data  updates  to  a  group  of  replicas. 
The  problem  of  locating  a  file  replica  by  file  handle  is  similar  to  the  problem  of  locating 
a  group  member  by  group  name.  It  would  be  interesting,  but  beyond  the  scope  of  our 
present  discussion,  to  compare  file  systems  in  these  terms. 

3.3  Write-tokens 

To  coordinate  access  to  replicated  data,  we  use  a  write-token  protocol.  This  protocol  is 
based  on  one  presented  in  [33,42].  A  write-token[ 27,28]  is  associated  with  each  file  group. 
Only  a  server  that  holds  the  token  is  allowed  to  distribute  updates  to  the  corresponding 
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Server 


X  Replica  holder  Q  -  File  group  member 

Figure  4:  Update  Distribution 

file  group.  An  update  requires  only  one  communication  round6  if  the  token  is  held.  A 
write-token  protocol  works  well  when  update  streams  tend  to  originate  from  one  source  for 
long  periods  of  time  as  in  a  file  system;  under  those  conditions  most  updates  will  require 
only  one  broadcast. 

The  token  holder  synchronously  collects  only  the  first  s  correct  replies,  where  s  is  the  write 
safety  level  of  the  file.  After  these  s  replies  have  been  collected,  the  original  client  RPC 
that  requested  the  update  will  return. 

A  server  that  lacks  a  token  must  acquire  it  before  distributing  an  update  for  that  file. 
Token  acquisition  requires  one  round,  but  it  is  only  done  for  the  first  in  a  series  of  updates. 

SA  communication  round  is  the  distribution  of  a  message  to  a  set  of  processes.  The  collection  of  syn¬ 
chronous  replies  is  included  in  the  round. 
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To  acquire  a  token,  a  server  broadcasts  a  token  request  to  that  file  group.  The  server  that 
holds  the  token  broadcasts  a  token  pass  in  response.  It  is  necessary  for  correctness  that 
the  updates  arrive  in  identical  order  at  all  servers  regardless  of  token  movement. 

There  are  several  optimizations  to  this  protocol.  One  optimization  is  to  broadcast  an 
update  in  the  same  message  with  a  token  request ,  replica  holders  execute  those  updates 
upon  receiving  the  corresponding  token  pass.  Another  optimization  is  to  pass  an  update 
to  the  current  token  holder  instead  of  requesting  the  token  if  it  is  likely  that  there  will 
be  only  one  update;  for  example,  a  small  file  that  is  overwritten  in  a  single  update  will 
probably  not  be  updated  again  soon.  Deceit  currently  uses  neither  of  these  optimizations. 


3.4  Global  One-copy  Serializability 

The  write-token  scheme  described  in  Section  3.3  is  sufficient  to  achieve  one-copy  serializ¬ 
ability  in  a  file  system  containing  only  one  file,  but  a  real  file  system  will  require  a  stronger 
mechanism.  Global  one-copy  serializability  is  defined  as  the  property  that  clients  should 
observe  one-copy  serializability  on  the  whole  file  system  rather  than  simply  on  individual 
files.  A  related  property  is  real-time  consistency ;  if  one  user  writes  a  file  and  calls  a  friend 
on.  the  phone,  the  friend  should  be  able  to  observe  the  update  within  a  bounded  delay. 

Global  one-copy  serializability  is  stronger  than  simple  one-copy  serializability  as  is  shown 
in  Figure  5.  In  this  example,  files  x  and  y  are  initially  empty.  Client  c\  appends  to  x  and 
then  appends  to  y.  Concurrently,  client  C2  successfully  reads  from  y  and  then  observes 
that  x  is  empty.  This  result  is  impossible  if  there  is  only  one  replica  of  i  and  y.  Yet  x  and 
y  separately  exhibit  one-copy  serializability. 

An  obvious  solution  is  to  wait  for  the  update  to  complete  at  every  member  of  the  file 
group  before  allowing  a  write  call  to  return  to  a  client.  Unfortunately,  this  can  lead  to  bad 
performance,  particularly  in  the  case  where  a  replicated  file  is  being  written  with  a  stream 
of  small  updates.  A  more  efficient  mechanism  that  allow  updates  to  complete  concurrently 
is  called  for. 

Deceit  provides  global  one-copy  serializability  with  a  stability  notification  mechanism.  Be¬ 
fore  a  file  can  be  modified,  all  members  of  the  file  group  are  notified  that  the  file  is  unstable. 
All  available7  replicas  must  be  so  notified  before  any  updates  can  occur  (the  failure  of  the 

7  A  replica  at  server  b  is  available  to  a  if  a  can  communicate  with  b.  ISIS  provides  a  clean  notion  of 
availability  since  failure  detection  is  coordinated  with  communication. 
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C2  reads  x 


Vj 
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ci  appends  x 

r  * reL 
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ci  appends  y 

Time  History  of  Operations  with  Files  x  and  y 


c\  appends  x 


time 


C2  reads  y 


C2  reads  x  c\  appends  y 

Separate  Time  Histories  of  x  and  y 


Figure  5:  Illustration  of  One-copy  Serializability  Example 


token  holder  during  stability  notification  is  discussed  in  Section  3.5).  After  stability  no¬ 
tification,  all  file  reads  and  inquiries  Me  forwMded  to  the  token  holder.  Only  the  token 
holder’s  replica  needs  to  be  updated  before  a  write  can  return  to  a  client.  One-copy 
serializability  is  guManteed  because,  in  effect,  the  token  holder’s  replica  is  now  the  “pri- 
mary”  replica.  After  a  short  period  of  no  write  activity,  the  token  holder  notifies  all  other 
members  of  the  group  that  the  file  is  stable  again.  Table  1  gives  a  short  summary  of  the 
sequence  of  events  required  in  a  normal  update.  Stability  notification  is  normally  invisible 
to  applications,  and  its  main  effect  is  on  performance  and  update  visibility  to  clients. 

The  main  benefit  of  stability  notification  is  that  updates  become  visible  to  all  clients 
simultaneously.8  On  the  other  hand,  an  overhead  is  incurred  at  the  beginning  and  end 

8In  &  distributed  system,  simultaneous  events  may  not  appear  to  happen  at  the  same  physical  time  since 
communication  delay  introduces  uncertainty. 
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Precondition 

Action 

1 

token  is  not  held 

acquire  token 

2 

replicas  are  not  marked  as  unstable 

mark  replicas  as  unstable 

3 

true 

distributed  update 

4 

failure  detected 

count  update  replies 

5 

insufficient  replicas 

generate  new  replicas 

6 

1 

period  of  no  write  activity 

mark  replicas  as  stable 

Table  1:  Typical  Sequence  of  Events  in  an  Update 

of  a  stream  of  updates.  This  overhead  can  be  expensive  if  updates  are  short  and  rare. 
Also,  reads  that  are  concurrent  with  updates  are  more  expensive.  By  default,  Deceit  uses 
stability  notification,  but  the  client  can  specify  that  stability  notification  is  not  used. 

3.5  Crash  and  Partition  Failures 

The  algorithm  presented  in  Sections  3.1  to  3.4  must  be  resilient  to  failure.  The  mechanisms 
that  Deceit  uses  for  this  purpose  are  presented  below. 


Histories  and  Version  Pairs 

Associated  implicitly  with  each  replica  of  file  /  is  an  update  history  f.h.  An  update  history 
is  a  list  of  all  updates  to  the  file  and  which  server  issued  these  updates.  History  f.h  is  an 
ancestor  of  history  f.h1  if  f.h  is  a  prefix  of  f.h'.  The  histories  that  all  replicas  of  a  file  pass 
through  form  a  tree  under  the  ancestor  relation;  this  tree  is  called  the  history  tree.  Two 
histories  are  incomparable  if  neither  is  an  ancestor  of  the  other. 

Deceit  does  not  explicitly  store  the  full  history  of  a  replica.  Instead,  Deceit  maintains  a 
one-to-one  mapping  from  histories  to  integer  pairs  (vuvf)  where  Ui  is  the  major  version 
number,  and  is  the  subversion  number9.  Vi  is  incremented  on  every  update,  and  v\  is 
changed  to  a  new  unique  number  every  time  there  is  a  potential  branch  in  the  history 
tree.  These  branch  points  are  recorded  with  a  replica  so  that  version  number  pairs  can  be 


9In  the  literature,  this  value  is  often  called  an  update  counter. 
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compared  as  if  the  histories  that  they  represent  were  available.  For  example,  the  relation 
(vi  =  v[  A  V2  <  v'2)  =>•  (f.h  is  an  ancestor  of  f.h')  always  holds. 

A  version  pair  is  stored  with  each  write  token.  The  token  version  pair  can  be  compared 
to  a  replica  version  pair  to  quickly  decide  if  a  replica  has  received  every  update  through 
that  token.  This  version  pair  is  available  to  the  user  through  a  special  command  so  that 
the  user  can  determine  if  a  file  has  been  modified. 


Token  Generation 

If  a  client  wish*-  to  update  a  file,  and  no  write  token  is  available  for  the  specified  version, 
a  new  token  ui  be  generated.  Assume  that  the  file  being  updated  has  major  version 
number  t?i,  and  a  replica  is  available  with  version  pair  (vi,  u2)>  Server  s  can  generate  a 
new  token  by  picking  a  globally  unique  major  version  number  v[  and  building  a  token  with 
version  pair  (vj,^),  then  s  stores  Vi  with  the  new  token.  Replicas  corresponding  to  the 
new  token  are  generated  by  copying  the  original  replica. 

Generating  a  new  token  is  more  than  simply  generating  a  new  version  pair.  Every  file 
replica  is  associated  with  only  one  token.  The  new  token  represents  a  distinct  new  file 
with  a  distinct  set  of  replicas.  After  the  new  token  is  generated,  enough  replicas  are 
generated  to  satisfy  the  minimum,  replica  level  constraint.  File  data  is  drawn  from  the 
existing  available  replica. 

It  may  be  necessary  to  constrain  when  a  token  can  be  generated.  Deceit  provides  file 
parameters  settings  that  provide  this  capability.  There  are  three  options.  The  first  option 
is  to  totally  inhibit  the  generation  of  new  write-tokens.  This  option  has  the  advantage  that 
a  server  can  always  write  to  a  file  after  it  has  acquired  the  write-token,  but  it  is  easy  to 
suffer  long  term  loss  of  file  availability.  The  second  option  is  to  allow  a  server  to  generate 
or  use  a  token  only  if  the  majority  of  the  replicas  are  available.  A  token  becomes  disabled 
if  the  majority  of  the  replicas  becomes  unavailable.  This  option  provides  relatively  high 
availability,  and  multiple  versions  can  be  generated  only  during  transitional  periods.  On 
the  other  hand,  it  is  more  difficult  to  implement,  and  write  availability  may  be  lost  in  the 
middle  of  a  stream  of  updates.  The  third  option  is  to  not  restrict  token  generation  at  all. 
Deceit  uses  this  second  option  as  the  default. 

Restricting  updates  to  the  majority  partition  requires  a  mechanism  for  counting  the  number 
of  available  replicas.  Replicas  are  normally  counted  by  counting  the  correct  replies  to  an 
update  broadcast.  All  replica  generation  must  be  accomplished  through  the  token  holder, 


3  REPLICATION  MANAGEMENT 


17 


so  that  the  token  holder  always  has  an  upper  bound  on  the  total  number  of  replicas.  For 
purposes  of  computing  a  majority,  the  total  number  of  replicas  is  taken  to  be  the  maximum 
of  the  minimum  replica  level  and  the  upper  bound  on  the  number  of  replicas.  For  a  server 
without  access  to  the  token,  the  total  number  of  replicas  is  assumed  to  be  the  minimum 
replica  level;  the  number  of  available  replicas  is  determined  by  broadcasting  an  inquiry  to 
the  file  group. 


Version  Control  System 

During  a  partition  event,  multiple  file  versions  can  be  generated.  It  follows  that  Deceit 
must  be  capable  of  maintaining  distinct  versions  which  are  distinguished  only  by  different 
values  for  the  major  version  number,  Uj.  The  facility  by  which  this  is  accomplished  may  also 
be  accessed  directly  at  the  user  level  as  a. normal  file  versioning  system,  such  as  in  a  source 
code  management  system.  Deceit  uses  a  simple  mechanism:  file  names  can  be  qualified 
with  version  numbers  using  a  special  syntax.  For  example,  major  version  3  of  “foo”  can  be 
referred  to  as  :tfoo;3.”  By  using  this  form  of  file  name,  specific  versions  can  be  created10, 
modified,  and  deleted.  By  using  an  unqualified  filename,  the  user  automatically  requests 
the  most  recent  available  version.  A  directory  entry  actually  uses  the  unqualified  filename, 
so  creating  a  new  file  version  does  require  an  update  to  a  directory.  The  system  behaves 
similarly  to  the  VAX/VMSn[ll]  version  control  system,  except  that  VMS  produces  a  new 
version  on  every  file  update,  while  Deceit  produces  new  versions  only  during  partitions  or 
when  explicitly  requested. 


Local  Non-volatile  Storage 

Several  types  of  information  must  be  kept  in  non-volatile  storage  to  allow  recovery  from 
a  crash.  Each  server  stores  all  file  data  for  its  replicas.  This  data  includes:  the  actual 
data  of  the  file,  the  replica  state,  and  the  version  pair.  Additionally,  each  server  stores 
all  state  information  relating  to  each  token  that  is  held.  Also,  each  server  stores  a  non¬ 
volatile  copy  of  the  map  between  file  handles  and  local  file  names.  Some  of  a  server’s 
non-volatile  storage  is  updated  immediately  when  values  change,  and  some  of  it  is  written 
asynchronously,  depending  on  safety. 

10Deccit  selects  major  version  numbers  carefully  to  insure  global  uniqueness.  Users  must  be  careful  when 
creating  new  versions  during  a  partition  to  preserve  uniqueness. 

UVAX/VMS  is  a  Trademark  of  Digital  Equipment  Corporation 
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3.6  Crash  Scenarios 

In  order  to  clarify  the  usage  of  the  crash  resilience  mechanisms,  several  example  scenarios 
are  presented. 


Non-token  Replica  Crash 

When  a  server  s  recovers  from  a  crash,  it  contacts  the  token  holder  for  each  file  /  such 
that  s  has  a  replica  but  no  token  for  /.  Each  token  carries  the  version  pair  that  replicas 
should  have  if  they  are  up  to  date.  If  s  finds  that  it  has  an  obsolete  replica  of  /,  s  destroys 
it.  Since  the  history  for  the  replica  of  5  is  a  prefix  of  the  history  associated  with  the  token, 
no  update  will  be  lost. 


Token  Crash 

Token  loss  is  detected  when  a  server  attempts  to  contact  a  token  holder  during  the  course 
of  normal  read  or  update  operations.  Let  us  assume  that  server  s  needs  to  distribute 
an  update  for  a  file,  but  it  can  not  contact  the  current  token  holder.  Subject  to  token 
generation  constraints,  s  can  generate  a  new  token.  Since  s  now  holds  the  token  for  a 
version  of  the  file,  s  can  complete  the  original  operation. 

Assume  that  s  could  not  contact  the  old  token  holder  s'  because  s'  had  crashed.  When 
s'  recovers,  it  will  be  notified  about  the  creation  of  the  new  version  during  its  recovery 
protocol,  s '  will  note  that  the  new  version  is  a  direct  descendent  of  the  old  version  and 
destroy  the  old  version  and  all  of  its  replicas. 


Partition 

Now  consider  the  scenario  where  there  was  a  network  partition,  but  no  updates  were  issued 
to  the  file  in  the  partition  with  the  token.  Read  access  on  the  token  holder  side  continues 
normally,  since  it  is  difficult  distinguish  between  this  scenario  and  the  case  where  the  other 
replicas  simply  crashed.  Write  access  in  the  partition  which  does  not  contain  the  old  token 
may  cause  a  new  token  to  be  generated.  When  the  partition  is  resolved,  the  old  token 
holder  will  be  notified.  It  will  appear  to  the  clients  as  if  the  token  had  actually  been 
moved,  and  the  updates  were  propagated  very  slowly  to  some  servers. 


3  REPLICATION  MANAGEMENT 


19 


The  hard  case  is  when  a  partition  occurs  and  updates  are  issued  to  the  file  on  both  sides 
concurrently.  In  this  case  both  of  the  incomparable  versions  of  the  file  are  kept,  and  a 
notification  is  logged  into  a  well  known  file.  It  is  the  responsibility  of  the  user  to  resolve 
such  conflicts.  By  allowing  the  user  to  resolve  incomparable  versions,  the  semantics  of  the 
file  may  be  used  for  resolution  [5,52,20].  Both  versions  are  made  available  to  the  user  and 
may  be  edited,  modified,  or  deleted  independently.  Since  concurrent  updates  are  assumed 
to  be  rare,  this  case  should  occur  very  rarely. 


Stability  Notification  in  the  Presence  of  Failure 

If  the  token  holder  t  for  a  file  /  loses  contact  with  some  of  /’ s  replicas  during  an  update 
distribution,  those  replicas  might  be  left  in  an  inconsistent  state.  Stability  notification  is 
used  to  detect  this  case.  Before  an  update  is  distributed,  all  available  replicas  are  marked 
as  unstable .  Therefore,  if  replica  states  are  inconsistent,  then  all  inconsistent  replicas  will 
be  marked  as  unstable. 

Inconsistency  is  detected  when  a  read  is  given  to  a  server  s  which  has  an  unstable  replica 
of  /,  and  s  is  unable  to  contact  t.  In  order  to  respond  to  a  read ,  s  must  locate  a  stable 
replica,  s  produces  a  stable  replica  by  broadcasting  to  /’s  file  group  to  determine  the  state 
of  all  available  replicas.  If  there  is  a  stable  replica  at  server  s' ,  the  operation  is  forwarded 
to  s'.  If  no  replica  is  marked  as  stable ,  s  forces  the  most  up  to  date  replica  to  be  stable , 
and  all  obsolete  replicas  are  destroyed. 


Disastrous  Failure 

Despite  all  of  these  precautions,  with  a  suitably  pathological  sequence  of  crashes  and 
recoveries,  it  is  still  possible  to  produce  non-one-copy  serializability.  For  example,  if  an 
obsolete  file  replica  recovers  and  all  other  replicas  simultaneously  crash,  the  file  will  appear 
to  go  back  in  time.  We  could  solve  this  problem  by  inhibiting  token  generation  and  by 
consulting  the  token  holder  during  every  operation,  but  this  solution  would  destroy  most 
of  the  benefit  of  replication. 
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4  File  Semantics 

Deceit  associates  the  following  semantic  parameters  with  each  file: 

1.  Minimum  Replica  Level  -  the  minimum  number  of  valid  replicas  that  must  be  main¬ 
tained.  For  example,  a  minimum  replica  level  of  3  would  force  Deceit  to  maintain  a 
valid  replica  on  at  least  3  separate  servers.  By  default,  the  value  is  1. 

2.  Write  Safety  Level  -  the  number  of  replica  servers  that  must  reply  to  an  update 
before  a  write  RPC  returns  to  a  client.  A  value  of  0  produces  asynchronous  unsafe 
writes;  a  value  greater  than  or  equal  to  the  number  of  available  replicas  produces 
slow  and  fully  synchronous  writes.  By  default,  the  value  is  1. 

3.  Stability  notification  -  specifies  whether  stability  notification  is  to  be  used.  Stability 
notification  guarantees  global  one-copy  serializability  and  real-time  update  propaga¬ 
tion,  but  there  is  a  performance  cost.  The  default  is  to  use  stability  notification. 

4.  File  migration  -  should  Deceit  automatically  attempt  to  create  a  non-volatile  replica 
of  file  /  on  a  server  that  receives  requests  from  a  client  for  /.  For  some  applications, 
it  may  be  bad  to  automatically  generate  local  replicas.  For  example,  for  a  very  large 
data  file,  generating  a  local  replica  may  consume  too  much  disk  space.  The  default 
is  that  file  migration  not  be  used. 

5.  Write  Availability  Level  -  determine  when  Deceit  can  generate  a  new  write- token  if 
a  token  has  been  lost.  If  this  flag  is  set  to  “high” ,  then  a  token  may  be  generated 
whenever  one  is  needed.  A  high  availability  means  it  is  likely  that  multiple  file 
versions  will  result  due  to  a  partition.  A  value  of  “medium”  allows  a  new  token  to  be 
generated  by  server  s  only  when  s  can  contact  a  majority  of  the  replicas,  and  a  token 
is  disabled  if  fewer  than  the  majority  is  available.  As  a  result,  some  replicas  may 
occasionally  be  “read  only,”  but  multiple  file  versions  will  occur  less  frequently.  A 
value  of  “low”  prevents  the  production  of  additional  tokens.  Loss  of  file  write  access 
may  be  frequent  and  long  term,  but  there  is  no  chance  of  generation  of  multiple 
versions.  The  default  value  is  “medium.” 
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5  System  Components 

The  Deceit  server  consists  of  two  components  as  shown  in  Figure  6.  The  first  component 
is  a  distributed  reliable  segment  server.  The  segment  server  provides  a  simple,  flat,  reliable 
distributed  file  service  with  no  user  level  security  or  user  specified  names.  There  is  no 
notion  of  directories  or  links  in  the  segment  server.  The  segment  server  implements  all  of 
the  update,  replication,  and  versioning  protocols,  and  it  is  the  layer  where  file  parameters 
exist.  On  top  of  the  segment  server  is  a  full  NFS  file  service  which  uses  the  segment  serve.- 
for  storage  and  communication,  called  the  NFS  file  service  envelope. 


Figure  6:  Expanded  view  of  Deceit  architecture 


Deceit  does  not  directly  address  most  security  issues.  It  is  assumed  that  communication 
between  instances  of  the  segment  server  is  secure  (e.g.  encrypted  or  physically  secure). 
Also,  the  local  files  used  for  storage  by  the  segment  server  are  inaccessible  to  unauthorized 
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users.  Client/server  communication  is  secured,  and  client  authentication  is  provided  using 
DES  encryption  in  the  NFS  interface.  It  is  beyond  the  scope  of  this  discussion  to  provide 
a  detailed  description  of  these  mechanisms. 

5.1  Segment  Server 

A  segment  contains  an  array  of  bytes  that  can  be  indexed  by  an  offset.  Associated  with 
a  segment  is  the  data  in  the  segment,  the  values  for  each  of  the  semantic  parameters,  a 
version  number  pair,  a  process  group,  and  read  and  write  timestamps.  The  interface  to 
the  segment  server  consists  of  five  normal  procedure  calls:  create,  delete ,  read,  write,  and 
setparam.  Create  has  no  arguments  and  simply  returns  a  handle  for  a  new  segment  of  zero 
length.  Delete  takes  a  segment  handle  and  deletes  all  storage  allocated  for  it.  Read  reads 
a  portion  of  a  segment.  Write  modifies  a  segment  by  replacing,  appending,  or  truncating 
data  in  the  segment.  Setparam  is  used  to  specify  semantic  parameters  on  a  segment. 

It  is  valuable  to  have  a  form  of  serial  transaction 12  when  accessing  a  segment.  Many 
distributed  files  systems,  particularly  earlier  ones,  had  strong  mechanisms  for  executing 
an  atomic  transaction.  For  example,  DFS[45]  had  an  extensive  transaction  mechanism 
which  involved  locking  and  intention  lists.  SWALLOW[50,49]  used  a  system  of  virtual 
time  stamps  to  provide  serializability.  The  Alpine  File  System[6]  used  log  files  to  provide 
recoverability.  Later  distributed  files  systems  had  weaker  atomicity  guarantees  in  order  to 
provide  better  performance. 

To  help  provide  a  transactional  capability,  Deceit  employs  the  version  number  pair.  A  read 
call  not  only  returns  data,  but  it  also  returns  the  version  pair  associated  with  that  data. 
A  write  also  returns  the  version  pair  of  the  segment  after  the  write  has  co:  .pleted.  A  write 
call  can  also  have  a  version  pair  as  a  parameter;  in  this  case  the  writ  will  succeed  only 
if  the  version  pair  of  the  segment  matches  the  version  pair  in  the  call  when  the  data  is 
actually  updated,  otherwise  an  error  will  be  returned.  Using  this  feature,  a  limited  type 
of  serial  transaction  can  be  achieved. 

This  notion  of  version  pairs  can  be  used  to  implement  an  optimistic  concurrency  control 
mechanism.  A  read  implicitly  begins  a  transaction  with  the  establishment  of  a  specific 
version  pair  for  the  segment.  Writer  _  j  then  issued  with  that  version  pair.  A  transaction 

12 An  atomic  transaction  two  properties:  recoverability  and  serializability.  Recoverability  means  that  the 
transaction  completes  or  fails  entirely.  Serializ’*  ility  means  that  the  transactions  exhibit  behavior  consistent 
with  some  total  ordering.  A  serial  transaction  .s  a  transaction  that  only  provides  serializability. 
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is  implicitly  completed  when  the  last  write  completes  successfully.  A  write  which  returns 
an  error  due  to  the  use  of  the  wrong  version  pair  is  similar  to  a  transaction  which  has 
been  aborted  due  to  the  detection  of  non-serial  behavior.  Unfortunately,  there  is  no  way 
to  back  out  of  previous  updates.  The  application  which  attempted  the  write  can  restart 
with  the  original  read  or  abandon  the  attempt  entirely.  In  effect,  transactions  have  been 
transformed  into  light  weight  tasks  which  run  at  the  level  above  the  segment  server. 

A  good  example  of  this  behavior  is  the  addition  of  an  entry  to  a  directory  in  the  Deceit 
file  system.  The  directory  is  read,  and  a  position  is  selected  in  the  directory  for  adding  the 
entry;  this  may  overwrite  an  existing  dead  entry  or  append  a  new  entry.  Then,  an  update 
is  given  to  the  segment  server  with  the  version  pair  returned  by  the  original  read.  If  a 
version  pair  conflict  occurs,  the  whole  operation  is  restarted. 


5.2  NFS  File  Service  Envelope 

The  full  file  service  is  built  on  top  of  the  reliable  segment  server.  The  principle  is  that 
every  file,  directory,  or  soft  link  is  mapped  into  a  unique  segment.  All  NFS  operations 
are  mapped  into  creates,  deletes,  reads,  and  writes  on  segments.  The  UNIX  kernel  does  a 
similar  transformation  when  it  transforms  user  file  operations  into  disk  operations.  The  full 
file  service  inherits  the  distributed  nature  of  the  segment  server  to  provide  a  full  distributed 
file  system.  Although  the  NFS  envelope  implementation  is  a  large  piece  of  software,  it  is 
totally  independent  of  the  underlying  implementation  of  the  segment  service.  In  principle, 
it  will  never  need  to  be  changed  despite  radical  changes  in  the  segment  server  protocols. 


Links  and  Garbage  Collection 

Since  the  segment  server  does  not  have  a  notion  of  links,  it  is  the  responsibility  of  the 
NFS  envelope  to  decide  when  a  file  is  no  longer  accessible,  so  the  storage  for  it  can  be 
deallocated.  Since  both  directories  and  files  may  have  multiple  versions,  and  since  creation 
of  new  links  may  be  hidden  behind  partitions,  it  can  be  very  difficult  to  decide  when  to 
delete  the  segment  for  a  file  [29].  A  normal  NFS  system  has  the  advantage  of  centralized 
control  and  single  file  versions,  so  a  simple  link  count  suffices  (however  the  link  counts  can 
be  corrupted  by  an  ill  timed  crash.) 

We  see  two  solutions  to  the  problem.  The  first  solution  is  to  extend  the  concept  of  a  link 
count.  The  link  count  would  correspond  to  the  total  number  of  link  copies,  where  every 
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replica  of  every  version  of  a  directory  referring  to  the  file  is  counted  once.  Refer  to  Figure  7 
for  an  example  of  this  type  of  link  count.  The  link  count  is  stored  in  the  segment  as  normal 
data.  To  maintain  safety,  it  would  be  impossible  to  add  a  link  to  a  file  unless  the  file  was 
write  available.  Unfortunately,  it  would  also  be  unsafe  to  create  new  replicas  or  versions 
of  directories  unless  each  file  in  the  directory  was  also  write  available.  Another  severe 
disadvantage  is  that  when  the  link  count  becomes  corrupted,  it  is  extremely  expensive  (or 
impossible)  to  recalculate. 


[Directory  2  i 

jSSBESi 

h(  llil  hi,  — w|i 

■Hi 

X  -  hard  link  to  file 


The  total  link  count  is  9. 


Figure  7:  Example  of  Link  Count  Computation 


The  solution  we  chose  is  more  complex.  An  uplink  list  of  directory  file  handles  is  stored 
with  each  file.  The  NFS  envelope  attempts  to  maintain  the  property  that  if  file  /  is  in 
directory  d,  then  d  is  in  the  uplink  list  of  some  version  of  /.  When  a  hard  link  is  made  to 
/  in  directory  d,  d  is  added  to  the  uplink  list  of  all  versions  of  /  which  can  be  updated  at 
that  time.  Deceit  also  keeps  a  standard  hard  link  count  with  /,  but  it  is  only  considered 
to  be  a  hint.  When  the  link  count  goes  to  zero,  the  NFS  envelope  checks  every  available 
version  of  every  directory  in  the  uplink  list.  If  none  have  a  link  to  the  file,  the  segment  is 
deallocated;  otherwise,  the  link  count  is  corrected. 

Our  solution  has  several  drawbacks.  A  file  /  may  neither  be  moved  nor  additional  links 
made  unless  the  uplink  list  of  some  version  of  /  can  be  safely  modified.  Also,  a  server  s 
may  add  an  uplink  to  /,  but  another  server  s'  may  never  see  that  uplink  if  s'  can  only 
contact  a  disjoint  set  of  versions.  As  a  result,  s'  may  prematurely  attempt  to  deallocate  /. 
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Another  drawback  is  that  if  the  link  count  of  /  is  corrupted  so  that  it  is  too  large,  /  may 
never  be  garbage  collected.  Finally,  when  a  file  is  moved,  two  directories,  a  link  count,  and 
an  uplink  list  must  be  modified  in  some  safe  order.  Garbage  collection  is  discussed  again 
briefly  in  Section  7. 


5.3  Client  Agents 

The  agent  is  a  simple  but  important  component  of  Deceit.  The  agent  is  the  client  software 
which  interfaces  between  the  user  process  and  the  NFS  protocol.  Currently,  the  agent  runs 
in  the  kernel,  but  the  agent  can  be  in  several  possible  locations.  Refer  to  Figure  8.  These 
different  configurations  provide  widely  differing  performance. 


normal  procedure  call, 
interprocess  communication, 
kernel  call, 

or  remote  procedure  call 

user  loadable  library, 
kernel  procedure, 
or  auxiliary  user  process 


local  interprocess  communication, 
or  remote  procedure  call  (NFS) 


Figure  8:  Example  Agent/Server  Configurations 


The  agent  satisfies  two  primary  functions.  First,  the  agent  provides  caching.  The  agent 
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caches  file  and  directory  data  a s  well  as  information  specific  to  the  client /server  protocol 
such  as  NFS  file  handles  and  server  information.  Another  agent  function  in  Deceit  is 
failover.  When  one  server  fails,  the  agent  must  select  another  to  continue  operation.  This 
second  capability  requires  an  extension  to  the  NFS  protocol. 

A  third  optional  agent  function  is  using  an  access  shortcut.  Normally,  a  server  forwards 
a  requests  for  which  the  server  does  not  have  a  replica.  It  is  more  efficient  for  the  agent 
to  cache  file  locations  and  directly  communicate  with  the  correct  servers.  This  capability 
requires  an  extension  to  the  NFS  protocol. 

Deceit  currently  uses  the  standard  NFS  client  software  provided  in  the  Sun  operating 
system.  This  software  does  not  provide  failover  or  shortcuts.  A  new  agent  is  being  written 
which  will  run  as  an  auxiliary  user  process,  and  it  will  provide  full  functionality.  An  agent 
which  can  be  loaded  as  a  user  library  and  directly  issues  NFS  RPCs  is  planned,  and  this 
agent  should  greatly  improve  file  performance. 

A  user  library  that  acts  as  an  agent  is  not  easily  implemented  for  several  reasons.  First, 
since  the  NFS  protocol  requires  file  handles  and  there  is  no  way  to  easily  extract  file  handles 
from  the  SunOS  kernel,  each  user  process  will  have  to  go  through  the  full  mount  protocol 
to  get  file  handles.  Also,  since  files  will  be  cached  in  the  user  process  virtual  memory  space, 
processes  will  have  to  use  some  type  of  coordination  protocol  to  share  cached  files. 


5.4  ISIS 

We  made  the  decision  to  use  ISIS  after  some  consideration;  an  early  version  of  Deceit  did 
not  use  ISIS  at  all.  When  we  used  ISIS,  several  issues  arose: 

•  ISIS  requires  a  separate  configuration  and  installation  phase  in  addition  to  the  one 
required  for  Deceit.  This  requirement  was  an  inconvenience  during  development,  and 
it  will  continue  to  be  a  problem  in  the  future.  A  stand-alone  package  requiring  little 
or  no  additional  configuration  information  would  be  preferable. 

•  Deceit  exposed  several  performance  and  development  problems  in  ISIS.  Group  mem¬ 
bership  change,  high  volume  state  transfer,  as  well  as  other  operations  were  too 
expensive.  Some  ISIS  features,  such  as  partition  tolerance,  were  underdeveloped 
when  Deceit  began  development.  These  issues  are  being  addressed.  Future  versions 
of  ISIS  should  allow  Deceit  to  have  satisfactory  performance. 
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•  One  particular  problem  was  the  huge  number  of  process  groups  that  could  be  gen¬ 
erated  by  Deceit.  In  the  current  implementation  of  ISIS,  process  groups  are  an 
expensive  resource.  More  specifically,  ISIS  does  not  efficiently  support  more  than 
100-1000  process  groups.  Future  versions  of  ISIS  should  support  larger  numbers  of 
process  groups.  Future  versions  of  Deceit  will  be  more  careful  with  generating  and 
deleting  process  groups. 

•  ISIS  saved  a  large  amount  of  development  time.  We  started  to  use  ISIS  after  real¬ 
izing  that  we  were  reimplementing  much  of  the  ISIS  functionality  in  Deceit.  ISIS 
also  provides  useful  debugging  primitives.  We  estimate  that  at  least  6  months  of 
development  was  saved. 

6  Scenarios 

The  previous  sections  provided  a  great  deal  of  detail  about  Deceit  without  much  discussion 
about  applications.  To  show  how  Deceit  could  be  used  to  solve  real  problems,  two  impor¬ 
tant  application  scenarios  are  listed  below.  For  each  scenario,  there  is  a  short  description 
of  the  scenario,  followed  by  a  description  of  how  Deceit  could  be  used  efficiently  in  this 
application. 


6.1  Academic  Public  Workstation  Environment 

This  environment  is  characterized  by  a  large  number  of  small,  inexpensive,  and  unreliable 
machines.  Administrative  control  is  often  poor.  Users  spend  the  bulk  of  their  time  editing 
or  compiling.  Files  tend  to  be  small,  and  their  physical  location  is  relatively  unimportant, 
but  high  availability  is  valuable. 

This  scenario  is  the  easiest  to  solve  since  Deceit  is  being  developed  and  tuned  in  this 
environment.  All  of  the  semantic  parameter  defaults  should  be  adequate.  Users  will 
typically  want  to  set  the  replication  level  to  2  or  3  on  important  source  and  text  files;  other 
files  can  be  regenerated  if  necessary.  The  system  administrator  should  set  the  replication 
level  to  be  2  or  3  on  all  important  system  directories,  binaries,  and  libraries.  Adding  new 
servers  is  simply  a  matter  of  configuring  ISIS  to  run  on  the  server,  and  executing  the  Deceit 
server  daemon.  Files  can  be  moved  transparently  from  one  server  to  another  by  the  system 
administrator  at  any  time  to  provide  better  disk  balancing. 


7  CONCLUSIONS 


23 


6.2  Data  Collection  and  Dispersion 

A  large  class  of  applications  requires  bulk  data  movement  and  manipulation.  For  example, 
NASA  collects  huge  amounts  of  data  at  several  remote  stations  which  is  processed  in  a 
central  computing  facility.  A  product  development  team  uses  large  detailed  specifications 
to  drive  simulations  which  can  be  at  a  distant  location.  This  environment  is  characterized 
by  a  small  number  of  large  machines  with  large  numbers  of  peripheral  machines  attached 
to  them.  Extremely  large  files  are  common.  Users  collect  large  quantities  of  data  on  some 
machines  and  analyze  it  at  other  machines.  Since  file  sizes  often  are  at  the  limits  of  disk 
space,  controlling  the  location  of  the  data  is  necessary.  There  may  be  large  geographical 
distances  within  the  system. 

For  a  very  large  data  file,  the  user  can  turn  off  automatic  localization  to  prevent  uncon¬ 
trolled  generation  of  file  replicas.  Also  the  minimum  replica  level  should  be  1  until  the  file 
has  reached  its  final  destination,  and  then  it  may  be  set  to  2  to  provide  a  single  backup. 
Since  data  versioning  may  lead  to  version  conflicts,  the  write  availability  level  will  probably 
need  to  be  “medium”  or  even  “low.”  Data  files  can  be  quickly  copied  from  one  server  to 
another  using  the  blast  file  transfer  mechanism  in  Deceit  by  manually  forcing  the  creation 
of  a  replica  on  the  target  server  and  then  deleting  the  replica  on  the  source  server.  At  any 
time  during  the  manipulation  of  the  data  location,  the  file  data  is  available  for  reading  and 
writing  via  any  server. 


7  Conclusions 

We  believe  Deceit  provides  enough  flexibility  so  that  most  applications  can  have  acceptable 
performance  and  availability.  All  the  basic  features  necessary  for  a  full  distributed  file 
system  with  replication  have  been  provided.  Deceit  performance  is  not  understood  if  the 
environment  does  not  satisfy  the  operational  assumptions  in  Section  2.3.  In  such  cases,  we 
may  be  required  to  add  new  operational  modes  be  added  to  Deceit. 

A  version  of  Deceit  exists  and  is  used  at  Cornell,  although  it  is  not  yet  in  general  use. 
Except  for  inter-cell  communication,  all  of  the  features  described  above  are  implemented. 
Development  is  still  at  an  early  stage,  and  we  expect  fundamental  architectural  changes 
as  our  experience  continues.  Performance  measures  would  be  premature  at  this  stage  of 
our  effort. 
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Two  serious  problems  still  need  to  be  addressed.  The  first  is  file  contention  bottlenecks. 
Certain  files  and  directories  such  as  the  root  directory  will  be  accessed  very  frequently  by 
all  servers.  It  is  fortunate  that  these  files  tend  to  have  read  only  access.  It  may  be  valuable 
to  have  special  file  modes  which  are  optimized  for  this  combination  of  properties[38,24]. 

Another  problem  is  the  use  of  links  and  directories  as  discussed  in  Section  5.2.  The  current 
solution  is  unsatisfactory,  so  we  are  looking  for  other  solutions.  Hopefully,  a  solution  to  this 
problem  may  also  offer  a  solution  to  the  root  directory  contention  problem.  One  possibility 
that  we  are  investigating  is  the  use  of  file  uplinks  to  allow  non-volatile  directories  to  be 
discarded. 

There  are  some  performance  problems  with  the  process  group  management  protocol. 
Group  joins  are  expensive,  and  broadcasts  are  more  expensive  than  need  to  be.  Also, 
using  a  huge  number  of  ISIS  groups  has  a  unacceptable  effect  of  ISIS  performance.  Since 
Deceit  uses  process  groups  in  a  restricted  and  well  defined  way,  it  is  inefficient  to  use  gen¬ 
eral  ISIS  groups  for  each  file  group.  Several  methods  to  improve  performance  are  being 
investigated. 
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